Prediction Models for Osteoporotic Fractures Risk: A Systematic Review and Critical Appraisal

Osteoporotic fractures (OF) are a global public health problem currently. Many risk prediction models for OF have been developed, but their performance and methodological quality are unclear. We conducted this systematic review to summarize and critically appraise the OF risk prediction models. Three databases were searched until April 2021. Studies developing or validating multivariable models for OF risk prediction were considered eligible. Used the prediction model risk of bias assessment tool to appraise the risk of bias and applicability of included models. All results were narratively summarized and described. A total of 68 studies describing 70 newly developed prediction models and 138 external validations were included. Most models were explicitly developed (n=31, 44%) and validated (n=76, 55%) only for female. Only 22 developed models (31%) were externally validated. The most validated tool was Fracture Risk Assessment Tool. Overall, only a few models showed outstanding (n=3, 1%) or excellent (n=32, 15%) prediction discrimination. Calibration of developed models (n=25, 36%) or external validation models (n=33, 24%) were rarely assessed. No model was rated as low risk of bias, mostly because of an insufficient number of cases and inappropriate assessment of calibration. There are a certain number of OF risk prediction models. However, few models have been thoroughly internally validated or externally validated (with calibration being unassessed for most of the models), and all models showed methodological shortcomings. Instead of developing completely new models, future research is suggested to validate, improve, and analyze the impact of existing models.

recommended to use prediction models integrating several risk factors to identify individuals at high risk of OF [9].
At present, numerous prediction tools for OF have been developed, including but not limited to the World Health Organization (WHO) Fracture Risk Assessment Tool (FRAX) algorithm [10], Qfracture algorithm [11], and Garvan Fracture Risk Calculator (Garvan) [12]. Some of them have been recommended in clinical guidelines for treatment management [13,14] and more and more advocated by health policymakers. Although there are some systematic reviews on OF prediction models [15][16][17], they are outdated with the latest literature search being performed in 2017 [16]. Further limitations include restriction to a few specific tools [17] or a certain population like women [15], or no critical appraisal of the included models with standardized criteria [16,17]. Hence, an updated systematic review of prediction models for OF is needed.
We conducted this systematic review and critical appraisal to summarize the characteristics of the development and validation of OF risk prediction model, assess its methodological quality and reporting quality, and provide up-to-date evidence for clinical implementation and future research.

METHODS
This systematic review was reported by following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [18]. The protocol of this systematic review has been registered in PROSPERO (registration number: CRD42020199196).

Search strategy
We systematically searched PubMed, Embase, and PsycINFO from inception to April 3, 2021. In addition, the reference lists of included studies were manually reviewed. The search strategy included the key concepts of i) osteoporotic fractures and osteoporosis and ii) risk prediction and related terms. The detailed search strategies are presented in Supplementary table 1.

Eligible criteria
Cohort studies that develop or validate risk prediction models for OF in the general population were considered eligible. Studies were excluded if i) the prediction model consisted of only one predictor; ii) they targeted secondary OF or focused on specific patient groups for the treatment of OF or related conditions; iii) the performance of the model was not reported; iv) they were reviews, conference abstracts, letters or protocols. In addition, if the development article was not available, the corresponding externally verified articles were excluded.

Literature selection
Two reviewers (HW and JS) independently selected the studies, determined eligibility, and resolved the discrepancies by consensus. When the difference is not resolved, the third reviewer (LQ) was invited to make a consensus decision.

Data extraction
Two reviewers (XS and YC) independently extracted the data with a pre-developed data extraction form, which was developed by following the guidance of the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS) checklist [19]. Extracted the following information from each included study: i) characteristics of the study (e.g., study design, data source); ii) data related to participants (e.g., country or region of participants, age, gender, events per variable (EPV)); iii) details about model development and validation (e.g., type of prediction model, predictors included in the model, modelling method) and model performance.
Multiple different models were included in a study, for example, separate models for men and women, separate models for different outcomes (e.g., hip fracture, major osteoporotic fractures (MOF), were included separately. When multiple versions (e.g., with different risk factors) of a model for the same population and outcome were included in a study, the model with the best performance was selected for data extraction. When an article validated multiple models, separate data extraction was performed for each model.
Model performance was assessed by discrimination and calibration. Discrimination is often quantified by the C index or area under the receiver operating characteristic curve (AUC). A C index or AUC less than 0.5 suggests no discrimination, 0.5 to 0.7 is poor, 0.7 to 0.8 is acceptable, 0.8 to 0.9 is excellent, and higher than 0.9 is outstanding [20]. Calibration can be visualized by a calibration plot and is usually quantified using the calibration intercept and the calibration slope, with a slope close to 1 and an intercept close to 0 indicating good calibration [21]. The indexes mentioned above were extracted from the publications when available. Sensitivity and specificity were extracted as well if available. Additionally, EPV was calculated to measure model overfitting. An EPV less than 20 was considered as overfitting for model development while less than 100 for model validation [22].

Risk of bias and applicability assessment
The risk of bias and applicability of each included study was independently assessed by two reviewers (ZZ and XS) using the prediction model risk of bias assessment tool (PROBAST) [23,24]. Discrepancies were resolved by consensus between the two reviewers, and a third author (YG) was invited for consensus adjudication in need. For risk of bias assessment, it contains four domains: participants, predictors, outcome, and analysis. Each domain was judged as low, high, or unclear risk of bias. The overall risk of bias was summarized according to the following rules: when all the four domains were judged as "low" risk of bias, the overall risk of bias was "low"; otherwise, "high" or "unclear" risk of bias was graded accordingly [23,24]. For applicability assessment, it contains three domains: participants, predictors, and outcome. It has similar assessment rules and procedures to the risk of bias assessment.

Statistical Analysis
All results were narratively summarized and described without any quantitative synthesis due to variation in predictors and characteristics of participants among the included prediction models.

Neuroprotective Effects of Celastrol in Neurodegenerative Diseases
Aging and Disease • Volume 13, Number 4, August 2022 1218

Study selection
The literature search identified 2852 records, of which 784 were removed due to duplication, and 1882 were excluded based on title and abstract. A total of 186 full texts were assessed, of which 68 articles met the eligibility criteria were included in this review (Fig. 1). In total, 38 articles focused on one or more development of OF risk prediction models, and 44 articles described one or more external validation of OF risk prediction models. Articles frequently concern combinations of development and external validation, leading to the total number of articles does not sum up to 68.

Sample size
The sample size of included models ranged from 405 to 12,011,134, and the incidence of fracture ranged from 0.1% to 31.4%. The EPV ranged from 0.1 to 6,613.3. Of the 70 models, 30 (43%) had an EPV less than 20, indicating the existence of over-model fitting (Table 1 and  Table 2).

Model presentation
Only 39 (56%) models provided model presentation as a web calculator, nomogram, or risk score of each predictor to allow practical use, while the remaining 31 (44%) models did not offer related information.

Risk of bias and applicability
All 70 models were judged as high overall risk of bias. Respectively 31 (44%) and 10 (14%) models had an unclear and high risk of bias in the outcome domain. Mainly because it is unclear whether a prespecified or standard outcome definition or subjective outcome measures (e.g., self-reported) had been used. All models (n=70, 100%) were at high risk of bias for the analysis domain, which is commonly due to the risk of overfitting caused by an insufficient number of cases, or categorization of continuous predictors. In addition, the calibration of many models was not assessed or was not assessed correctly (e.g., using Hosmer-Lemeshow test). In terms of applicability, 44 (63%) models had a low concern while the remaining 26 (37%) had a high concern. The most common concern about applicability was the outcome domain, which focused on hip fracture. The models focused on predicting hip fracture may not accurately predict all osteoporosis fractures. Details on the risk of bias and applicability assessments are presented in Figure 2 and Supplementary table 3.

Sample size
The sample size ranged from 412 to 1,136,417, and the incidence of fracture ranged from 0.1% to 22.1%. The EPV ranged from 0.1 to 16,312.8, and 114 (83%) models were less than 100, indicating the existence of over model fitting (Table 1 and Table 2).

Risk of bias and applicability
Most models (n=126, 91%) were judged as high overall risk of bias, while the remaining 12 (9%) were unclear risk of bias, and no low risk of bias model was identified. The most common issues were seen in the analysis domain, in which 126 (91%) models were rated as high risk of bias. The most common reason was the insufficient number of cases or the incorrect assessment of calibration. Several models have an unclear risk (n=58, 42%) or high risk (n=15, 11%) of bias in outcome domain. It is mainly because of the unclarity of whether a prespecified or standard outcome definition or subjective outcome measures (e.g., self-reported) had been used. In applicability section, 88 (64%) models had a low concern, and the remaining 50 (36%) models had a high concern, because they focused on hip fracture in the outcome domain. Details on risk of bias and applicability assessments are presented in Figure 3 and Supplementary  table 4.

Model comparison
FRAX, QFracture, and Garvan were the three most used tools in clinical practice. In addition, there were also some tools with a potential clinical value that had been externally verified with good performance (e.g., FRA-HS, WHI). The details of these models that have been externally validated as well as their advantages and disadvantages were summarized in Table 3. Table 3. Predictors, advantages and disadvantages of externally validated models.

Author
Model Details of the predictors included in the model

DISCUSSION
This systematic review summarized and critically appraised 68 studies focused on OF risk prediction models in the general population, with 70 developed models and 138 external validations. Only a few models showed outstanding (n=3, 1%) or excellent (n=32, 15%) prediction discrimination. There was a paucity (n=22, 31%) of external validation models among these developed models. Notwithstanding there were a few notable exceptions, such as FRAX with BMD (for MOF) and FRAX with BMD (for hip fracture)). Calibration of developed models (n=25, 36%) or external validation models (n=33, 24%) were rarely assessed. Moreover, no model was appraised as having a low risk of bias.
We found much variability in the geographical location of both model development and model validation. However, the majority of models were developed and validated in the UK, the US, or China. Although studies have shown that osteoporosis fractures in low or middle-income countries are also prevalent [90], no model has been developed or validated among the population from Africa, South America, and the Middle East. Tailored models for populations in these countries are important because it is well known that predictoroutcome associations vary among ethnic groups [91]. In the future, more external validation studies among the aforementioned uncovered populations are needed to improve the generalizability of existing models, which is also a cost-effective choice than investing extra research funding in developing new models [92].
Although postmenopausal females are at high risk of OF, with the increase of age, the incidence of OF in males will increase significantly. Furthermore, the mortality and disability of OF in males are higher than that in females [93]. Therefore, osteoporosis is an underestimated bone condition among the male population [94]. Although research progress has been made on OF in male [37,57], we found that most models were developed (n=31, 44%) and validated (n=76, 55%) specifically for female, with relatively less models being specifically developed (n=23, 33%) or validated (n=33, 24%) for male. Future studies are suggested to pay attention to risk prediction models specific to the male population.
It is worth noting that some models only included a few numbers of predictors (e.g., two or three predictors) [32,35,46], or easily measured predictors [29] also showed promising model performance when compared to those models [57] that used multiple complex predictors like SNPs. Moreover, due to a large number of predictors and resources demanding for measurement, the practical application of these complex models (including a large number of SNPs) is limited. On the other hand, as the gold standard for the diagnosis of osteoporosis, BMD has been included in several prediction models [34,35,39,40,46,48]. This review found that many studies showed Garvan and FRAX with BMD had higher discrimination than Garvan and FRAX without BMD [39]. However, we also observed similar or even better model performance in models without BMD, such as QFracture [84], and WHI [29], indicating that BMD may not be an essential predictor for future fracture. Hence, an increasing number of predictors or including complex predictors may not necessarily improve model performance. Complex predictors (e.g., BMD, SNPs) could be replaced by other easily measurable predictors (e.g., age, prior fractures, history of falls) for future studies under the circumstances when it is unavailable, difficult to obtain, or showed no evidence of improving model performance.
FRAX, QFracture, and Garvan are the top three commonly used models for OF prediction. FRAX (10 or 11 predictors) is a model recommended by the WHO to evaluate the risk of OF [10]. It has strong applicability and operability and has been used worldwide [17]. In this systematic review, we found that FRAX with BMD (for MOF) (n=37, 27%) was the most externally validated model, but its model performance was not particularly good; Compared with FRAX alone, the model performance of its extended model was slightly improved, but most of them had not been externally verified. The Garvan (4 predictors) contained the least predictors that are easy to measure as well [12]. That facilitates its practical use. However, the model performance of the Garvan was relatively poor [16]. The QFracture was developed through electronic medical records and showed the best model performance among the three models. Nevertheless, the larger number of predictors (26 predictors for males and 25 predictors for females) limits its practical application to a certain extent [11]. Moreover, there were some models (e.g., FRA-HS) with potentially clinical value and good performance [43], had neither been externally verified in different populations nor were rarely used in clinical practice. As a result, there is no one fit for all models being recommended in this review. The model performance, applicability, and characteristics should be considered for selecting OF prediction model [16].
Modeling methods include classical regression methods (e.g., Cox proportional hazards regression, Logistic regression) and artificial intelligence methods (e.g., machine learning). Generally, classical regression methods have the defect of lower prediction performance [57]. Compared with classical regression methods, artificial intelligence methods have a powerful ability for data analysis and exploration. Models developed through artificial intelligence methods showed the advantages of accuracy, sensitivity, and efficiency [59,95]. In this systematic review, 7 (10%) models that adopted machine learning methods indicated relatively good discrimination. However, artificial intelligence modeling requires huge and high-quality data. In addition, the model is prone to overfitting [59]. Nonetheless, with the coming of the big data era, artificial intelligence methods have more applications in the medical field and could be considered as a flexible alternative for risk prediction in large datasets.
This systematic review did not consider model impact studies, which will quantify the benefits, harms, and costs of introducing a new prediction risk model through comparative design, it is also the final crucial step to identify whether the model can be applied to the clinic [96,97]. A recent related systematic review only identified three model impact studies on OF [98]. Results from this systematic review showed that population screening could effectively reduce OF and hip fractures, however, the information on the costs and screening interval was still unclear [98]. More rigorous impact studies are needed to determine whether OF risk prediction models should be implemented in clinical practice.

Recommendations and implications
Accurate OF risk evaluation can allow clinicians and individuals in understanding the risk of OF and guide them to make decisions to mitigate the risks [99]. When choosing a model for the prediction of OF risk, its accuracy, applicability, convenience, data availability, and cost should be considered. When developing models, simple models with less number or easily measured predictors should be considered as a priority choice to improve the clinical feasibility and practicality of the models. Given a large number of existing models, priority for the future studies should recalibrate and extend the existing OF prediction models to improve prediction performance, and conduct external verification and analysis of model impact, instead of developing new models from scratch [92].

Strengths and limitations
The strengths of this review include systematic literature search, rigorous study selection, and detailed data extraction on the main characteristics of OF prediction models. Furthermore, we evaluated the risk of bias and applicability of all the identified models to suggest where improvements are needed in future OF prediction model studies. However, this review also has some limitations. Firstly, due to the varied heterogeneity across studies, the results were not quantitatively synthesized, which limited the comparability of models. Secondly, although we conducted an exhaustive literature search, some relevant citations may be missed due to no attempt of grey literature search. This may underestimate the number of development and validation models,

Conclusion
In conclusion, our systematic review found that although there were a certain number of OF risk prediction models, most of the developed models had not been thoroughly internally validated or externally validated (with calibration being unassessed for most of the models). Most of the models showed poor performance as well. Moreover, all models suffered from methodological shortcomings. Given the availability of large and combined datasets, more rigorous studies are suggested to validate, improve and analyze the impact of existing OF risk prediction models in the general population rather than developing completely new models. Rigorous studies on OF prediction models are needed to target to males and the population in low or middle-income countries.